-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFT] Node ejected before epoch recovery #6632
base: feature/efm-recovery
Are you sure you want to change the base?
[BFT] Node ejected before epoch recovery #6632
Conversation
… committee instead of consensus committee
… Updated godoc for tests
…-eject-before-recovery-test
@@ -55,9 +56,6 @@ func GenerateRecoverEpochTxArgs(log zerolog.Logger, | |||
|
|||
internalNodesMap := make(map[flow.Identifier]struct{}) | |||
for _, node := range internalNodes { | |||
if !currentEpochIdentities.Exists(node.Identity()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was originally put here to communicate an unusual state to operators when running the efm-recover-args
command. I agree it should not cause an error, but maybe we can replace this with a WARN log. Then when it is run as part of the CLI tool we will still surface this info to operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
publicKeyShares := make([]crypto.PublicKey, 0, dkg.Size()) | ||
for i := uint(0); i < dkg.Size(); i++ { | ||
nodeID, err := dkg.NodeID(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both instances where we currently are using DKG.NodeID
, we are using it in order to construct a list of participant public key shares. Since the underlying DKG
implementation (backed by EpochCommit
) already contains this data in the DKGParticipantKeys
field, why don't we replace the DKG.NodeID
method with DKG.AllKeyShares()
, that returns EpochCommit.DKGParticipantKeys
?
Then, we can replace lines 63-74 with:
publicKeyShares := dkg.AllKeyShares()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds good, I will add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// prepare the staking public keys of participants | ||
stakingKeys := make([]crypto.PublicKey, 0, len(allParticipants)) | ||
stakingBeaconKeys := make([]crypto.PublicKey, 0, len(allParticipants)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stakingBeaconKeys := make([]crypto.PublicKey, 0, len(allParticipants)) | |
beaconKeys := make([]crypto.PublicKey, 0, len(allParticipants)) |
In other parts of the code, we generally use the term "beacon key". Since what we are mainly differentiating it from is the "staking key", I think omitting "staking" from the name makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a special case, it's basically beacon keys that take part in staking. There is a separate beaconKeys
down the line that is used only for random beacon protocol.
} | ||
|
||
stakingSigAggtor, err := signature.NewWeightedSignatureAggregator(allParticipants, stakingKeys, msg, msig.ConsensusVoteTag) | ||
if err != nil { | ||
return nil, fmt.Errorf("could not create aggregator for staking signatures: %w", err) | ||
} | ||
|
||
dkg, err := f.committee.DKG(block.View) | ||
rbSigAggtor, err := signature.NewWeightedSignatureAggregator(allParticipants, stakingBeaconKeys, msg, msig.RandomBeaconTag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rbSigAggtor, err := signature.NewWeightedSignatureAggregator(allParticipants, stakingBeaconKeys, msg, msig.RandomBeaconTag) | |
beaconAggregator, err := signature.NewWeightedSignatureAggregator(allParticipants, stakingBeaconKeys, msg, msig.RandomBeaconTag) |
Nit: Reuse the "beacon" terminology from above, expand Aggtor so it's pronounceable. (Feel free to ignore if you disagree)
require.Equal(s.T(), events[0].Events[0].Type, eventType) | ||
|
||
// 5. Ensure the network transitions into the recovery epoch and finalizes the first view of the recovery epoch. | ||
startViewOfNextEpoch := uint64(txArgs[1].(cadence.UInt64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried adding a conditional to wait for endViewOfNextEpoch+1
and validate s.AssertInEpoch(s.Ctx, 2)
? (Successfully complete the DKG and epoch transition in the recovery epoch.) It would be nice to include this in one of the test cases if the runtime is manageable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried it but I can add it.
integration/tests/epochs/cohort2/epoch_recover_from_efm_test.go
Outdated
Show resolved
Hide resolved
// wait until the epoch setup phase to force network into EFM | ||
s.AwaitEpochPhase(s.Ctx, 0, flow.EpochPhaseSetup, 10*time.Second, 500*time.Millisecond) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// wait until the epoch setup phase to force network into EFM | |
s.AwaitEpochPhase(s.Ctx, 0, flow.EpochPhaseSetup, 10*time.Second, 500*time.Millisecond) |
In principle, stopping the Collection Node before entering the EpochSetup phase would be more reliable. The Consensus Nodes send off the first DKG message as soon as the EpochSetup phase starts, and can technically send their result submission after the DKG final view, so this could be race-prone.
integration/tests/epochs/cohort2/epoch_recover_from_efm_test.go
Outdated
Show resolved
Hide resolved
|
||
// assert transition to second epoch did not happen | ||
// if counter is still 0, epoch emergency fallback was triggered as expected | ||
s.AssertInEpoch(s.Ctx, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this mechanism for checking EFM dates to before we added EpochPhaseFallback
. We could retrieve the snapshot (currently line 248) a bit earlier, and assert that snapshot.EpochPhase() == EpochPhaseFallback
, to make the test logic a little more explicit.
Co-authored-by: Jordan Schalm <[email protected]>
Co-authored-by: Jordan Schalm <[email protected]>
…b.com/onflow/flow-go into yurii/6331-eject-before-recovery-test
During addressing comments found out that DKG is very inconsistent and flaky under this setup, left more details here: #6729. |
#6331
Context
This PR makes improvements on code clarity based on parent branch and implements a new test which ejects a node before submitting the recovery transaction which leads to a scenario where the DKG committee is a superset of consensus committee when recovery takes place. To support this behavior some changes to the tooling were made as well as vote processing logic to account for inequality of random beacon and consensus committees